class: center, middle, inverse, title-slide # Introduction to Survey Data Cleaning Using Tidyverse in R ## Data Wrangling - Part 1 ### Johannes Breuer
Stefan Jünger ### 2021-07-22 --- layout: true <div class="my-footer"> <div style="float: left;"><span>Johannes Breuer, Stefan Jünger</span></div> <div style="float: right;"><span>ESRA 2021, 2021-07-22</span></div> <div style="text-align: center;"><span>Data Wrangling - Part 1</span></div> </div> --- ## Data wrangling 🤠 <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\tidyverse-workshop-esra-2021\content\img\data_cowboy.png" width="95%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- ## What is data wrangling? Data wrangling is the process of "getting the data into shape", so that you can then explore and analyze them. Common data wrangling steps when working with survey data include: - **renaming** variables - **recoding** variables/values - **creating/computing** new variables - **selecting** a subset of variables - **filtering** a subset of cases -- The (in)famous **80/20-rule**: 80% wrangling, 20% analysis<sup>1</sup> .footnote[ [1] Of course, this ratio relates to the time required for writing the code, not the computing time. ] --- ## Data wrangling Simply put, data wrangling means... ```r wrangled_data <- original_data %>% do_something() ``` This is where the `dplyr` package come in. <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\tidyverse-workshop-esra-2021\content\img\dplyr.png" width="25%" style="display: block; margin: auto;" /> --- ## `dplyr` functions - `dplyr` functions are verbs that signal an action - first argument = a dataframe - output normally also a dataframe (tibble) - columns (= variables in a tidy dataframe) can be referenced without quotation marks (non-standard evaluation) - actions (verbs) can be applied to columns (variables) and rows (cases/observations) --- ## Load the data Before we can wrangle the data, we need to load it. We will work with the `csv` version of the synthetic data. For knowing which variables we will be working with, it helps to consult the [codebook](https://dbk.gesis.org/dbksearch/download.asp?id=67378) for the original data. .small[ ```r gpc <- read_csv("./data/ZA5667_v1-0-0_Stata14_synthetic-data.csv") ``` ] **NB**: This command uses a relative path that assumes that the working directory is the folder containing the workshop materials. --- ## Note: Tidy vs. untidy data A lot of work (by many people) has already gone into the data that these synthetic data set is based on set, so these data are already tidy. If you collect data yourself, this may not be the case (at least for the raw data). For example, cells may hold more than one value or a variable that should be in one column is spread across multiple columns (e.g., parts of a date or name). If you need to make your data tidy or change it from wide to long format or vice versa (which may, e.g., be necessary if you work with longitudinal survey data from multiple waves), the [`tidyr` package](https://tidyr.tidyverse.org/) from the `tidyverse` is a good option. --- ## Selecting variables We might want to reduce our dataframe (or create a new one) to only include a subset of specific variables. This can be done with the `dplyr` verb `select`. ```r gpc_risk <- gpc %>% select(hzcy001a, hzcy002a, hzcy003a, hzcy004a, hzcy005a) head(gpc_risk) ``` ``` ## # A tibble: 6 x 5 ## hzcy001a hzcy002a hzcy003a hzcy004a hzcy005a ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 4 6 3 6 4 ## 2 4 6 6 6 4 ## 3 2 2 2 2 2 ## 4 NA NA NA NA NA ## 5 6 6 4 6 6 ## 6 4 4 3 4 4 ``` --- ## Selecting variables There also is a shorthand notation for selecting a set of consecutive columns with `select()`. ```r gpc_risk <- gpc %>% select(hzcy001a:hzcy005a) head(gpc_risk) ``` ``` ## # A tibble: 6 x 5 ## hzcy001a hzcy002a hzcy003a hzcy004a hzcy005a ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 4 6 3 6 4 ## 2 4 6 6 6 4 ## 3 2 2 2 2 2 ## 4 NA NA NA NA NA ## 5 6 6 4 6 6 ## 6 4 4 3 4 4 ``` --- ## Unselecting variables `select()` from `dplyr` also allows you to easily exclude one or more columns/variables. ```r gpc_risk_self <- gpc_risk %>% select(-c(hzcy002a, hzcy005a)) head(gpc_risk_self) ``` ``` ## # A tibble: 6 x 3 ## hzcy001a hzcy003a hzcy004a ## <dbl> <dbl> <dbl> ## 1 4 3 6 ## 2 4 6 6 ## 3 2 2 2 ## 4 NA NA NA ## 5 6 4 6 ## 6 4 3 4 ``` --- ## Advanced ways of selecting variables `dplyr` offers several helper functions for selecting variables. For a full list of those, you can check the [documentation for the `select()` function](https://dplyr.tidyverse.org/reference/select.html). ```r gpc_cy <- gpc %>% select(starts_with("hzcy")) gpc_cat <- gpc %>% select(ends_with("_cat")) glimpse(gpc_cat) ``` ``` ## Rows: 3,765 ## Columns: 2 ## $ age_cat <dbl> 10, 2, 8, 1, 7, 7, 7, 7, 8, 6, 9, 7, 2, 2, 7, 7, 7, 4, 10, 7, 4, 2, 4, 9, 9, 8, 10, 2, 7, 10, 7, 7, 7, 3, 7, 1~ ## $ education_cat <dbl> 3, 3, 1, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 3, 2, 3, 2, 2, 2, 2, 3, 3, 2, 2, 3, 3, 3, 3, 2, 3, 2, 3, 2, 3, 3, 2, 2,~ ``` *Note*: You can also consult the help file for the function in `R`/*RStudio* through the command `?select`. --- ## Advanced ways of selecting variables Another particularly useful selection helper is `where()`. You can, e.g., use `where()` to select only a specific type of variables, such as `select(where(is.numeric))`. As the synthetic data set we use in our course only includes numeric variables, however, we will, instead, use `where()` to select only binary variables (that have a *min* of 0 and a *max* of 1). ```r gpc_bin <- gpc %>% * select(where(~ min(.x, na.rm = T) == 0 & max(.x, na.rm = T) == 1)) gpc_bin %>% select(1:5) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 5 ## $ hzcy006a <dbl> 1, 1, 1, NA, 1, 1, 1, NA, NA, 1, 1, NA, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, NA, 1, 1, NA, 0, 1, 1, 0, NA, NA, 1,~ ## $ hzcy007a <dbl> 1, 1, 1, NA, 0, 1, 1, NA, NA, 0, 1, NA, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, NA, 1, 1, NA, 0, 1, 1, 1, NA, NA, 1,~ ## $ hzcy008a <dbl> 0, 0, 1, NA, 0, 0, 1, NA, NA, 0, 1, NA, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, NA, 0, 1, NA, 0, 1, 1, 0, NA, NA, 0,~ ## $ hzcy009a <dbl> 0, 0, 0, NA, 0, 0, 0, NA, NA, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, NA, 0, 0, 0, 0, NA, NA, 0,~ ## $ hzcy010a <dbl> 0, 1, 0, NA, 0, 0, 0, NA, NA, 0, 0, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, NA, 0, 1, 0, 0, NA, NA, 0,~ ``` --- ## Renaming variables As the variable names are not self-explanatory, we might want to change them. We will use 🐍 *snake_case*<sup>1</sup> in our examples. ```r gpc_risk <- gpc_risk %>% rename(risk_self = hzcy001a, # new_name = old_name risk_surroundings = hzcy002a, risk_hospital = hzcy003a, risk_quarantine = hzcy004a, risk_infect_others = hzcy005a) names(gpc_risk) ``` ``` ## [1] "risk_self" "risk_surroundings" "risk_hospital" "risk_quarantine" "risk_infect_others" ``` .footnote[ [1] For a good introduction to naming schemes that avoid spaces, you can check out the [Medium post by Patrick Divine](https://medium.com/@pddivine/string-case-styles-camel-pascal-snake-and-kebab-case-981407998841). ] --- ## Renaming variables For some more advanced renaming options, you can use the `dplyr` function `rename_with()`. ```r gpc_risk <- gpc_risk %>% rename_with(toupper) names(gpc_risk) ``` ``` ## [1] "RISK_SELF" "RISK_SURROUNDINGS" "RISK_HOSPITAL" "RISK_QUARANTINE" "RISK_INFECT_OTHERS" ``` *Note*: The [`janitor` package](https://github.com/sfirke/janitor) (which is `tidyverse`-oriented) can be used to facilitate several common data cleaning tasks, including renaming variables. --- ## Re~~wind~~name selecta You can also use `select` to select and rename variables in one step. ```r gpc_risk <- gpc %>% select(risk_self = hzcy001a, risk_surroundings = hzcy002a, risk_hospital = hzcy003a, risk_quarantine = hzcy004a, risk_infect_others = hzcy005a) names(gpc_risk) ``` ``` ## [1] "risk_self" "risk_surroundings" "risk_hospital" "risk_quarantine" "risk_infect_others" ``` --- ## Moving columns Although the positions of columns in a dataframe do not matter for analyses or plotting (unless you want to select columns using their numerical index), you might want to change them. For this purpose, `dplyr` provides the `relocate()` function. ```r gpc_risk <- gpc_risk %>% relocate(risk_infect_others, .after = risk_surroundings) glimpse(gpc_risk) ``` ``` ## Rows: 3,765 ## Columns: 5 ## $ risk_self <dbl> 4, 4, 2, NA, 6, 4, 4, NA, NA, 4, 3, NA, 5, 4, 6, 5, 2, 6, 2, 6, 5, 5, 4, 3, 4, NA, 2, 5, NA, 1, 3, NA, 4,~ ## $ risk_surroundings <dbl> 6, 6, 2, NA, 6, 4, 4, NA, NA, 6, 3, NA, 5, 4, 6, 5, 3, 6, 3, 5, 4, 6, 5, 4, 4, NA, 3, 5, NA, 1, 6, 4, 4, ~ ## $ risk_infect_others <dbl> 4, 4, 2, NA, 6, 4, 4, NA, NA, 7, 1, NA, 4, 4, 4, 5, 2, 5, 2, 5, 4, 5, 6, 3, 4, NA, 3, 5, NA, 3, 3, NA, 3,~ ## $ risk_hospital <dbl> 3, 6, 2, NA, 4, 3, 3, NA, NA, 3, 4, NA, 3, 3, 5, 4, 2, 2, 2, 4, 2, 3, 3, 2, 4, NA, 3, 4, NA, 1, 3, NA, 3,~ ## $ risk_quarantine <dbl> 6, 6, 2, NA, 6, 4, 4, NA, NA, 6, 2, NA, 5, 4, 5, 5, 2, 4, 2, 3, 5, 6, 5, 3, 3, NA, 4, 6, NA, 3, 3, NA, 4,~ ``` *Note*: You can also move a column before a specific other column by providing a variable name to the `.before` argument (instead of `.after`). --- ## `dplyr::relocate()` <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\tidyverse-workshop-esra-2021\content\img\dplyr_relocate.png" width="85%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- ## Filtering rows/observations In `R`, you can filter rows/observations dependent on one or more conditions. To filter rows/observations you can use... - **comparison operators**: - **<** (smaller than) - **<=** (smaller than or equal to) - **==** (equal to) - **!=** (not equal to) - **>=** (larger than or equal to) - **>** (larger than) - **%in%** (included in) ... and combine them with - **logical operators**: - **&** (and) - **|** (or) - **!** (not) - **xor** (either or, not both) --- ## Filtering rows/observations The `dplyr` solution for filtering rows/observations is the verb `filter()`. ```r gpc_male <- gpc %>% filter(sex == 1) dim(gpc_male) ``` ``` ## [1] 1933 111 ``` --- ## Filtering rows based on multiple conditions ```r gpc_old_men <- gpc %>% filter(sex == 1, age_cat > 7) dim(gpc_old_men) ``` ``` ## [1] 643 111 ``` --- ## `dplyr::filter` - multiple conditions By default, multiple conditions in `filter()` are added as & (and). You can, however, also specify multiple conditions differently. **or** (cases for which at least one of the conditions is true) .small[ ```r gpc_old_andor_male <- gpc %>% filter(sex == 1 | age_cat > 7) dim(gpc_old_andor_male) ``` ``` ## [1] 2427 111 ``` ] **xor** (cases for which only one of the two conditions is true) .small[ ```r gpc_old_or_male <- gpc %>% filter(xor(sex == 1, age_cat > 7)) dim(gpc_old_or_male) ``` ``` ## [1] 1784 111 ``` ] --- ## `dplyr::filter()` <img src="data:image/png;base64,#C:\Users\breuerjs\Documents\Lehre\tidyverse-workshop-esra-2021\content\img\dplyr_filter.jpg" width="95%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- ## Advanced ways of filtering observations .small[ Similar to `select()` there are some helper functions for `filter()` for advanced filtering of rows. For example, you can... - Filter rows based on a range in a numeric variable ```r gpc_centrist <- gpc %>% filter(between(political_orientation, 4, 6)) dim(gpc_centrist) ``` ``` ## [1] 2050 111 ``` *Note*: The range specified in `between()` is inclusive (on both sides). - Filter rows based on the values of specific variables matching certain criteria ```r gpc_risk_low <- gpc_risk %>% filter(if_all(everything(), ~ . < 4)) # read: if the values of all vars in this df are < 4 dim(gpc_risk_low) ``` ``` ## [1] 294 5 ``` *Note*: The helper function `if_any()` can be used to specify that at least one of the variables needs to match a certain criterion. ] --- ## (Re-)Arranging the order of rows Again, while this does not directly matter for analyses or plotting (unless you want to filter rows by their numeric index), you can rearrange the order of rows in a data set. The appropriate `dplyr` verb for this is `arrange()`. ```r gpc %>% * arrange(age_cat) %>% select(sex:household) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 8 ## $ sex <dbl> 2, 2, 1, 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 2, 1, ~ ## $ age_cat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~ ## $ education_cat <dbl> 3, 3, 3, 3, 1, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3, 2, 3, 3, 2, 2, 3, 1, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, ~ ## $ intention_to_vote <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 97, 2, NA, 2, 2, ~ ## $ choice_of_party <dbl> 98, 98, 7, 5, 98, 5, 7, 4, 6, 5, 98, 5, NA, 2, 1, 2, 2, 5, 6, 1, 98, 6, 2, 1, 4, 4, 5, 98, 97, 5, NA, ~ ## $ political_orientation <dbl> 1, 4, 7, 5, 5, 4, 3, 3, 3, 2, 5, 3, 5, 3, 5, 4, 3, 4, 8, 6, 0, 6, 5, 7, 1, 4, 1, 6, 10, 4, 5, 6, 5, 6,~ ## $ marstat <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~ ## $ household <dbl> 3, 3, 3, 3, 3, 3, 3, 2, 3, 2, 2, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~ ``` --- ## (Re-)Arranging the order of rows Of course, it is also possible to sort a dataframe in descending order of a variable. ```r gpc %>% * arrange(desc(age_cat)) %>% select(sex:household) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 8 ## $ sex <dbl> 1, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, ~ ## $ age_cat <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10~ ## $ education_cat <dbl> 3, 2, 3, 3, 2, 1, 3, 3, 3, 3, 1, 2, 3, 2, 1, 2, 1, 2, 3, 3, 3, 3, 1, 1, 3, 3, 3, 2, 3, 2, 3, 1, 1, 3, ~ ## $ intention_to_vote <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,~ ## $ choice_of_party <dbl> 98, 1, 5, 5, 6, 3, 5, 98, 3, 2, 6, 4, 3, 2, 4, 1, 4, 98, 2, NA, 98, 4, 1, 1, 3, 1, 5, 1, 2, 3, 6, 1, 2~ ## $ political_orientation <dbl> 7, 8, 2, 3, 5, 5, 2, 4, 7, 5, 5, 5, 7, 5, 6, 7, 5, 5, 3, 3, 6, 3, 5, 8, 6, 3, 6, 4, 4, 5, 7, 5, 9, 6, ~ ## $ marstat <dbl> 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 4, 3, 3, 1, 4, 2, 1, 4, 1, 4, 1, 1, 1, 1, 4, 2, 1, 2, 1, 1, 1, ~ ## $ household <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 3, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, ~ ``` *Note*: You can also use `-` instead of `desc()`, if you sort by numeric variables. --- ## (Re-)Arranging the order of rows dataframes can also be sorted by more than one variable. ```r gpc %>% * arrange(age_cat, education_cat) %>% select(sex:household) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 8 ## $ sex <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 1, ~ ## $ age_cat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~ ## $ education_cat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, ~ ## $ intention_to_vote <dbl> 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 97, 2, 2, 2, 2, 2, 2, 2, 2~ ## $ choice_of_party <dbl> 98, 6, 4, 7, 98, 98, 98, 5, 6, 2, 2, 6, 1, 4, 2, 4, 6, 1, 5, 2, 98, 1, NA, 98, 1, 98, 6, 98, 98, 7, 5,~ ## $ political_orientation <dbl> 5, 6, 3, 6, 3, 5, 3, 3, 3, 3, 4, 8, 6, 4, 5, 3, 6, 2, 3, 4, 2, 5, 2, 6, 2, 0, 3, 1, 4, 7, 5, 4, 3, 3, ~ ## $ marstat <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~ ## $ household <dbl> 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, ~ ``` --- class: center, middle # [Exercise](https://jobreu.github.io/tidyverse-workshop-esra-2021/exercises/Exercise_3.html) time 🏋️♀️💪🏃🚴 ## [Solutions](https://jobreu.github.io/tidyverse-workshop-esra-2021/solutions/Exercise_3.html)